PAC-MDP learning with knowledge-based admissible models
نویسندگان
چکیده
PAC-MDP algorithms approach the exploration-exploitation problem of reinforcement learning agents in an effective way which guarantees that with high probability, the algorithm performs near optimally for all but a polynomial number of steps. The performance of these algorithms can be further improved by incorporating domain knowledge to guide their learning process. In this paper we propose a framework to use partial knowledge about effects of actions in a theoretically well-founded way. Empirical evaluation shows that our proposed method is more efficient than reward shaping which represents an alternative approach to incorporate background knowledge. Our solution is also very competitive when compared with the Bayesian Exploration Bonus (BEB) algorithm. BEB is not PAC-MDP, however it can exploit domain knowledge via informative priors. We show how to use the same kind of knowledge in the PAC-MDP framework in a way which preserves all theoretical guarantees of PAC-MDP learning.
منابع مشابه
Lower PAC bound on Upper Confidence Bound-based Q-learning with examples
Abstract Recently, there has been significant progress in understanding reinforcement learning in Markov decision processes (MDP). We focus on improving Q-learning and analyze its sample complexity. We investigate the performance of tabular Q-learning, Approximate Q-learning and UCB-based Q-learning. We also derive a lower PAC bound Ω( |S| |A| 2 ln |A| δ ) of UCB-based Q-learning. Two tasks, Ca...
متن کاملProbably Approximately Correct (PAC) Exploration in Reinforcement Learning
OF THE DISSERTATION Probably Approximately Correct (PAC) Exploration in Reinforcement Learning by Alexander L. Strehl Dissertation Director: Michael Littman Reinforcement Learning (RL) in finite state and action Markov Decision Processes is studied with an emphasis on the well-studied exploration problem. We provide a general RL framework that applies to all results in this thesis and to other ...
متن کاملPAC Reinforcement Learning Bounds for RTDP and Rand-RTDP Technical Report
Real-time Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic con...
متن کاملPAC Reinforcement Learning Bounds for RTDP and Rand-RTDP
Real-time Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic con...
متن کاملIncreasingly Cautious Optimism for Practical PAC-MDP Exploration
Exploration strategy is an essential part of learning agents in model-based Reinforcement Learning. R-MAX and V-MAX are PAC-MDP strategies proved to have polynomial sample complexity; yet, their exploration behavior tend to be overly cautious in practice. We propose the principle of Increasingly Cautious Optimism (ICO) to automatically cut off unnecessarily cautious exploration, and apply ICO t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010